Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

add Falcon 40B model support #368

Merged
merged 13 commits into from
Jul 27, 2023
Merged

Conversation

skirodev
Copy link
Contributor

This PR has already passed the tests of the Falcon mini series models, but due to limitations of my device, I haven't tested it with the original Falcon series models. #293

@philpax philpax requested a review from LLukas22 July 15, 2023 10:54
@philpax
Copy link
Collaborator

philpax commented Jul 15, 2023

Looks promising, I'll leave it to Lukas to check the specifics.

Is the model definition/file format beginning to stabilise a little? Part of the reason it's still experimental is because, as far as I can tell, upstream is still figuring out the specifics of the implementation.

@skirodev
Copy link
Contributor Author

skirodev commented Jul 15, 2023

@philpax Yeah, perhaps it should remain in an experimental state until it undergoes all necessary tests and stabilizes.

@LLukas22
Copy link
Contributor

@skirodev On a first glance this looks good. How did you create your ggmlv3 versions of the model? I would like to try 7B- and 40B-instruct before i merge this.

@skirodev
Copy link
Contributor Author

@LLukas22 I just uploaded the conversion code (only for f16 and f32) to Hugging Face. The quantization code is not completed yet.

@LLukas22
Copy link
Contributor

Thanks! You an probably just use llm to quantize your converted models, it should support the normal Q-quants if you implemented the hyperparams correctly.

@skirodev
Copy link
Contributor Author

Thanks for the suggestion! I'll try using llm to quantize the converted models and see how it works.

@skirodev
Copy link
Contributor Author

The performance of using the falcon-7b-instruct-ggml-q4_0.bin model is shown in the following:

cargo run --release -- infer --no-float16 -n 256 -a falcon -m "./models/falcon-7b-instruct-ggmlv3-q4_0.bin" --batch-size 512 -p "write a story about falcon" --stats

falcon-7b-instruct-ggml-q4_0

@LLukas22
Copy link
Contributor

Alright, I used your script to create ggml versions of falcon-7B and falcon-40B and quantized them with llm to q4_0.

7B works as expected. Output:

write a story about falconry
About
Portfolio
Contacts
For all that is left in the forest, hunt down foxes, wolves, rabbits, and even birds of prey. The first is the hunt, the second is the capture of the bird, and the third is the training of the hawk. Falconers also breed hawks, and some birds may pass through more than one set of hands before they are considered ready for sale or use as a pet. However, you need to be an experienced bird owner before you start a hunt. There are three main types of hunting hawks: lure, live lure, and lure/live. Falconers are trained in many different types of birds, each with its own set of specialized handling techniques and training methods. Falconers hunt with two birds - one as a hawk and the other as a lure bird. For example, falconers use a lure hawk to hunt with live birds and lure hawks to hunt with lure birds. The lure hawk uses birds that are already wild and free to hunt for its own food, whereas the live hawk hunts live birds that are kept in captivity, either for the purpose of training or for hunting purposes. The birds that hunt with lure birds hunt birds like pigeons, doves, and prairie

Sadly 40B produces gibberish 😞. Output:

write a story about falcon/ it or  and  I your body, on the all of 
 not of the  it  new/  can a all you  the so  so. more- this. our past- so in: so I to so you  we'. in  we a  you  it a no.’ your people should a I of it,  by and, and  all the I  to, of

 that what is not all a. for  I can new
 for  in an and we you from all:  you may in of  all
/ to  - D

I don't know if the inference code is the problem or if the conversion-script/quantization corrupted some tensors.

crates/ggml/src/context.rs Outdated Show resolved Hide resolved
crates/llm/Cargo.toml Outdated Show resolved Hide resolved
crates/models/falcon/src/lib.rs Show resolved Hide resolved
@skirodev
Copy link
Contributor Author

Sadly 40B produces gibberish 😞. Output:

write a story about falcon/ it or  and  I your body, on the all of 
 not of the  it  new/  can a all you  the so  so. more- this. our past- so in: so I to so you  we'. in  we a  you  it a no.’ your people > should a I of it,  by and, and  all the I  to, of

 that what is not all a. for  I can new
 for  in an and we you from all:  you may in of  all
/ to  - D

I don't know if the inference code is the problem or if the conversion-script/quantization corrupted some tensors.

My apologies, could you retest it now?

@LLukas22
Copy link
Contributor

     Running `target\release\llm.exe infer --no-float16 -n 256 -a falcon -m C:\Users\lkreu\Desktop\falcon\falcon-40b-q4_0 --batch-size 512 -p "write a story about falcon" --stats -r tiiuae/falcon-40b`
⣟ Loading model...[2023-07-16T10:54:11Z INFO  cached_path::cache] Cached version of https://huggingface.co/tiiuae/falcon-40b/resolve/main/tokenizer.json is up-to-date
✓ Loaded 484 tensors (23.5 GB) after 750ms
write a story about falcon, our you this the . more to  not and
 on your so the- you.  I. it. we may a so it for that or any
 the and, and for more the and  we  I by no  so in and just   of our- a can of. all,? I it  at the 
 so,  is it of my personal to that. 
 the and by is it

It got a bit better but there probably still is something wrong. Maybe we should wait for the ggml update and revisit it then?

@skirodev
Copy link
Contributor Author

It got a bit better but there probably still is something wrong. Maybe we should wait for the ggml update and revisit it then?

Absolutely. I agree with your suggestion. Let's await the ggml update and take another look at it when the time comes. Thank you for conducting the testing.

@LLukas22
Copy link
Contributor

Could you try to pull the latest main into this? It should contain the latest ggml version.

@skirodev
Copy link
Contributor Author

skirodev commented Jul 17, 2023

Could you try to pull the latest main into this? It should contain the latest ggml version.

Sure, I already pulled the latest main branch for the updated ggml version.

@philpax philpax requested a review from LLukas22 July 17, 2023 10:10
@LLukas22
Copy link
Contributor

Falcon 40B still produces gibberish:

PS F:\Github\llm-main> cargo run --release --features falcon -- infer -a falcon -p "Tell me a story about a falcon" -m "C:\Users\lkreu\Desktop\falcon\falcon-40b-q4_0" --no-float16
    Finished release [optimized] target(s) in 0.24s
     Running `target\release\llm.exe infer -a falcon -p "Tell me a story about a falcon" -m C:\Users\lkreu\Desktop\falcon\falcon-40b-q4_0 --no-float16`
✓ Loaded 484 tensors (23.5 GB) after 114ms
Tell me a story about a falcon . more to or- they in, you your. de/ /   to  they.: 
  you, I so good more on to one. on a for all  I  so,.  i  or  and - and it  that  :: and la

Maybe i need to reconvert/quantize my model 🤔

@skirodev
Copy link
Contributor Author

Falcon 40B is now capable of successful inference after my testing.

cargo run --release -- infer -a falcon -m "./models/falcon-40b-instruct-ggmlv3-q4_0.bin" --batch-size 512 -p "write a story about falcon" -r tiiuae/falcon-40b --stats

The majestic bird of prey soared through the sky, its wingspan stretching outwards as it searched for prey. Its sharp 
eyes scanned the horizon, and in an instant, it spotted movement below. With powerful strokes of its wings, it dove 
towards its target at incredible speeds before striking with lightning-fast precision. The falcon was a symbol of 
strength, agility, and intelligence – an awe-inspiring creature that commanded respect from all who saw it soar above.

However, the embedded tokenizer code still needs modification as Falcon does not require adding bos token id and has some special tokens, which may depend on the implementation of GGUF format.

@LLukas22
Copy link
Contributor

Good Job 👍

I'll give this another look tomorrow and if everything works i'm gonna merge it.

Copy link
Collaborator

@philpax philpax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good! Will leave it to Lukas to do final tests but OK from my end. Hopefully we can get the additional information from GGUF soon.

@LLukas22
Copy link
Contributor

LGTM 👍

Now even works with the fp16 memory :D

@LLukas22 LLukas22 merged commit 2259555 into rustformers:main Jul 27, 2023
@ghost
Copy link

ghost commented Aug 2, 2023

Does it works with Metal?

@LLukas22
Copy link
Contributor

LLukas22 commented Aug 2, 2023

@jempabroni Maybe, depends on if all necessary operations were already ported into metal shaders. You can try using it and if it gives you an invalid operation error it's not supported yet.

@hhamud hhamud mentioned this pull request Aug 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants